Skip to content

docs: move MILESTONES + NORTHSTARS into docs/#313

Merged
trilamsr merged 3 commits into
mainfrom
docs/move-narrative-to-docs-dir
Jun 1, 2026
Merged

docs: move MILESTONES + NORTHSTARS into docs/#313
trilamsr merged 3 commits into
mainfrom
docs/move-narrative-to-docs-dir

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Audit flagged that MILESTONES.md (108K) and NORTHSTARS.md (28K) lived at the repo root while every other narrative doc (STRATEGY, FAILURE-MODES, maintainership, RELEASE-CHECKLIST, etc.) lives under docs/. Both are project-internal narrative, not user-facing surface — same audience and same lifecycle as their siblings already in docs/. The inconsistency was the root cause; this PR fixes it.

Root at / reserved for the conventional top-level files only: README, CONTRIBUTING, CODE_OF_CONDUCT, SECURITY, LICENSE, CHANGELOG, PRINCIPLES, STYLE, AGENTS, CLAUDE.

Changes

  • git mv MILESTONES.md docs/MILESTONES.md
  • git mv NORTHSTARS.md docs/NORTHSTARS.md
  • Inside-file relative links in both moved files: docs/... paths lose their prefix; SECURITY.md, go.mod, CODE_OF_CONDUCT.md, CONTRIBUTING.md, PRINCIPLES.md, STYLE.md gain ../.
  • Cross-file link rewrites: 52 markdown links across README, CONTRIBUTING, AGENTS, docs/README, docs/RELEASE-CHECKLIST, docs/maintainership, docs/patterns/README, docs/followups/{README,M3}, docs/research/{m5-m6,m15,m16}, docs/rfcs/{0002,0003,0008,0009,0010,0011,0012}, components/receivers/pyspy/README, docs/proposals/gen-ai-training-semconv.
  • Load-bearing CI/script paths: scripts/doc-check.sh milestones_doc variable now docs/MILESTONES.md; install/kubernetes/tracecore/Chart.yaml artifacthub URL bumped; .github/{branch-protection.yml, workflows/chaos.yml, ISSUE_TEMPLATE/feature_request.md} comment refs updated.
  • Go-comment factual references in bench/overhead, module/pkg/nccl/fr_parser, module/receiver/ncclfrreceiver, tools/failure-inject brought current.

112 remaining bare-text mentions of the filename in prose (e.g., "the MILESTONES.md M15 rubric names...") were left alone — they identify the doc by name, not as a clickable link, and remain readable.

docs: move MILESTONES.md and NORTHSTARS.md into docs/ to match the project-internal-narrative convention. Repo-root paths now redirect through cross-reference link rewrites; no user-facing API or operator-facing surface changes. CI gates (scripts/doc-check.sh) and Helm chart artifacthub URL updated to point at the new location.

Test plan

  • bash scripts/doc-check.sh — passes (554 markdown links resolve, 64 non-md intra-repo paths resolve, 7 unverified markers still detected at the new docs/MILESTONES.md baseline location, banned-phrase lint clean across 114 files, all required top-level docs present)
  • Pre-commit golangci-lint run ./... — 0 issues
  • Pre-commit go vet ./... — clean
  • Pre-commit go mod verify — all modules verified
  • Pre-commit no-autoupdate-check_test — all assertions pass
  • Programmatic link verifier (every [X](path) referencing either moved file): 52/52 link-syntax targets resolve to on-disk files; 0 broken
  • No remaining MILESTONES.md or NORTHSTARS.md at repo root

Audit flagged that MILESTONES.md (108K) and NORTHSTARS.md (28K) lived at
the repo root while every other narrative doc (STRATEGY, FAILURE-MODES,
maintainership, RELEASE-CHECKLIST, etc.) lives under docs/. Both files
are project-internal narrative, not user-facing surface — they belong in
docs/ alongside their siblings.

Mechanical move + cross-reference sweep:

- git mv MILESTONES.md docs/MILESTONES.md
- git mv NORTHSTARS.md docs/NORTHSTARS.md
- inside-file relative-link rewrites in both moved files (docs/* paths
  lose their prefix; ../SECURITY.md, ../go.mod, ../CODE_OF_CONDUCT.md,
  ../CONTRIBUTING.md, ../PRINCIPLES.md, ../STYLE.md gain theirs)
- cross-file link-syntax rewrites: 52 markdown links across README,
  CONTRIBUTING, AGENTS, docs/README, docs/RELEASE-CHECKLIST,
  docs/maintainership, docs/patterns/README, docs/followups/README +
  M3, docs/research/{m5-m6,m15,m16}, docs/rfcs/{0002,0003,0008,0009,
  0010,0011,0012}, components/receivers/pyspy/README, docs/proposals/
  gen-ai-training-semconv
- CI/script load-bearing paths: scripts/doc-check.sh milestones_doc
  variable, install/kubernetes/tracecore/Chart.yaml artifacthub URL,
  .github/{branch-protection.yml,workflows/chaos.yml,ISSUE_TEMPLATE/
  feature_request.md} comments
- Go-comment factual references in bench/overhead, module/pkg/nccl/
  fr_parser, module/receiver/ncclfrreceiver, tools/failure-inject

Files kept at root per the audit charter: README, CONTRIBUTING,
CODE_OF_CONDUCT, SECURITY, LICENSE, CHANGELOG, PRINCIPLES, STYLE,
AGENTS, CLAUDE.

Verification: doc-check.sh passes (554 markdown links resolve, 7
unverified markers still detected at the new path, banned-phrase lint
clean across 114 files). The 112 remaining bare-text mentions of the
filename (e.g., "the MILESTONES.md M15 rubric") are prose references,
not links, and still readable.

Signed-off-by: Tri Lam <tri@maydow.com>
@trilamsr trilamsr enabled auto-merge (squash) June 1, 2026 06:40
Tri Lam added 2 commits May 31, 2026 23:51
…o-docs-dir

# Conflicts:
#	docs/MILESTONES.md
…o-docs-dir

# Conflicts:
#	module/receiver/ncclfrreceiver/doc.go
@trilamsr trilamsr merged commit f1dc099 into main Jun 1, 2026
21 checks passed
@trilamsr trilamsr deleted the docs/move-narrative-to-docs-dir branch June 1, 2026 07:08
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Resolve 5 conflicts post-PR #310 / #312 / #313:
- factory.go deleted on main (merged into patterndetector.go);
  port wave's selftel wiring (#261) into the merged createLogs
- VerdictAttr* unexported per #310; rename 16 wave-added consts
  + all callers across cuda_oom + ib_link_flap + pcie_aer tests
- docs/{MILESTONES,FOLLOWUPS,patterns/README}.md path + content
  reconcile after MILESTONES.md moved to docs/

Address reviewer findings before PR:
- docs/THREAT-MODEL.md case-mismatch -> docs/threat-model.md
  (Linux CI is case-sensitive)
- pattern.id schema drift: 8 specs said `ib_link_flap`/`cuda_oom`,
  code emits "2"/"10"/.../"13"; rewrite spec attribute tables to
  match shipped customer-stable namespace
- pattern.confidence: 8 specs said `high|partial`, code emits
  `full|partial`; rewrite
- 02-ib-link-flap.md attribute drift: spec said
  tracecore.alert.ib_link_flap.{hca_device,port}, code emits
  hw.network.ib.{device,port.num}; align spec to shipped code
- v1-rc1-cut-criteria criterion #1 status stale-on-arrival
  ("6 patterns shipped" -> "8 patterns shipped, 4 remaining")
- NetPol UX trap: NOTES.txt warning when networkPolicy.enabled=true
  with empty allowedEgressEndpoints (silently kills OTLP exporter)
  + warning when ServiceMonitor scraper in different namespace
- File #337 for missing OTTL recipe projecting DCGM FB_USED/FREE
  -> hw.gpu.memory.{free,total} log shape (CUDA OOM detector
  consumes but recipe gap means it ships dark)

Tests: ./module/processor/patterndetectorprocessor/... +
./module/pkg/patterns/... both ok.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
The audit docs were authored when NORTHSTARS.md + MILESTONES.md lived
at the repo root. main moved them to docs/ in PR #313 just before this
wave landed. Sibling docs reference these by relative path; 22 links
were stale. Replaced ../{NORTHSTARS,MILESTONES}.md → {NORTHSTARS,MILESTONES}.md
across three files. doc-check passes.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
After PR #313 moved NORTHSTARS.md into docs/, the Spec column links
added in the pattern-spec commit kept the pre-move ../docs/patterns/
prefix; from docs/NORTHSTARS.md the correct relative path is just
patterns/. 12 links fixed; doc-check clears.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr pushed a commit that referenced this pull request Jun 1, 2026
Authoring drift: the v1-rc1-cut-criteria bullets pre-existed the
PR #313 MILESTONES.md → docs/MILESTONES.md move, so links carried
docs/ prefix that now double-resolves to docs/docs/. Strip to
sibling-relative.

Signed-off-by: Tri Lam <tri@maydow.com>
trilamsr added a commit that referenced this pull request Jun 1, 2026
…ts (#338)

## Summary

15-agent parallel wave bridging v1.0-rc1 knowledge gaps + closing
horizon backlog. 31 commits, 81 files, +8650/-180.

**Code (5 detectors / features):**
- `feat(iblinkflap)` pattern #2 IB link flap detector — 13 tests,
cross-rank helper extracted for reuse by patterns #7/#9
- `feat(cudaoom)` pattern #10 CUDA OOM detector +
fragmentation-vs-true-OOM discriminator — 35 tests, 0/6 false-positive
rate on fixture corpus (#303 wiring — recipe gap tracked at #337)
- `feat(verdict)` deprecate EvictedPod, co-emit PodName + PodNamespace
(#277) with regression-pinning test
- `feat(chart)` opt-in default-deny NetworkPolicy + cert-manager mTLS
reference (#301); ServiceMonitor + scrape annotations (#296); NOTES.txt
UX warnings for empty-egress / cross-ns scraper traps
- `feat(bench)` per-detector allocs/event harness + soft ratchet gate,
graduation criterion documented (#302)
- `feat(patterndetector)` verdict counter metric for dashboard panels
(#261)
- `fix(slo-rules)` correct otelcol_* label set + drop silent-no-op
`unless on (instance)` join (#298)

**8 pattern design specs (`docs/patterns/{02,07-13}-*.md`):**
- Per pattern: symptom, layers crossed, signal sources, detector
evaluation rule, verdict attrs, edge cases, open questions.
- 7 load-bearing spec gaps flagged for future TDD red-test work
(multi-vendor SDC signal, cohort grouping, processor metrics path, etc).

**9 v1.0-rc1 audit / knowledge-gap docs:**
- `docs/v1-rc1-cut-criteria.md` — 12 falsifiable cut gates derived from
O1-O7
- `docs/v1-rc1-operational-gaps.md` — SLSA L3 + air-gap +
upgrade-rollback audit (8 issues filed #314-#321)
- `docs/v1-rc1-governance-gaps.md` — CODEOWNERS 0%, lint-principles
4/16, retros, `make ci` 148s (5 issues #322-#325, #327)
- `docs/v1-rc1-test-audit.md` — 82.9% coverage, fuzz harness inventory
(5 issues #328-#332)
- `docs/v1-rc1-simplification-audit.md` — top deletion candidates ~9.6K
LOC (3 issues #333-#335)
- `docs/threat-model.md` — STRIDE per trust boundary + audit RFP scope
(#336)
- `docs/reference-environments.md` — Tier 1 kind + Tier 2 32×H100
binding spec for O2 hero KPI
- `docs/adoption-pipeline.md` — S0-S3 funnel + comms templates for O5
hero KPI
- `docs/standards-roadmap.md` — 10 `gen_ai.training.*` attributes
proposed upstream (#326)

**Doc-drift cleanup:** 11 issues closed (#265, #268, #269, #276, #283,
#287, #292-295, #299).

**OTTL recipe wiring:** 6 issues closed (#260, #261, #273, #282, #284,
#285); #272 deferred to standards-roadmap.

**Multi-cluster auth:** bearer-token + mTLS examples (#297).

**Merge resolution + reviewer fixes:**
- Resolved 5 conflicts post-PR #310/#312/#313 (factory.go delete,
VerdictAttr* unexport, MILESTONES.md → docs/, FOLLOWUPS, patterns
README)
- Adversarial reviewer found 1 BLOCKER + 6 MAJOR; all addressed before
push:
  - Renamed 16 `VerdictAttr*` → `verdictAttr*` per #310 convention
  - Re-ported selftel wiring (#261) into main's merged `createLogs`
- Fixed case-mismatch `docs/THREAT-MODEL.md` → `docs/threat-model.md`
(Linux CI is case-sensitive)
- 8 pattern specs schema drift: `pattern.id` slug → numeric (`"2"`,
`"7"`...`"13"`), `pattern.confidence` `high` → `full`
- `02-ib-link-flap.md` attribute drift: spec said
`tracecore.alert.ib_link_flap.{hca_device,port}`, code emits
`hw.network.ib.{device,port.num}`
- `v1-rc1-cut-criteria` criterion #1 status stale-on-arrival ("6
patterns shipped" → "8 patterns shipped, 4 remaining")
- NetPol UX trap: NOTES.txt warns when `enabled=true` with empty
`allowedEgressEndpoints` (silently kills OTLP) or cross-ns Prometheus
- Filed #337 for missing OTTL recipe projecting `DCGM_FI_DEV_FB_*` →
`hw.gpu.memory.{free,total}` (CUDA OOM detector consumes but recipe gap)
- Post-merge stale-relative-path sweep: 6 wave docs + NORTHSTARS.md +
MILESTONES.md (`docs/`, `../`, `docs/docs/` drift after MILESTONES +
NORTHSTARS moved to docs/)
- Documented 5 newly-emitted attributes in ATTRIBUTES.md (drop_ratio +
IB tier — `attribute-namespace-check` now 67/67)

## Test plan

- [x] `go test ./module/processor/patterndetectorprocessor/...
./module/pkg/patterns/...` — ok
- [x] `make lint` (golangci-lint via goreleaser-style gate) — 0 issues
- [x] `go vet ./...` — clean
- [x] `make doc-check` — passes after stale-link sweep
- [x] `scripts/attribute-namespace-check.sh` — 67/67 documented
- [x] `helm lint install/kubernetes/tracecore` — 0 chart(s) failed
- [x] `promtool check rules` on slo-rules.yaml — 13 rules / SUCCESS
- [ ] CI compat-matrix (rc1 criterion #6) — gated on next wave
- [ ] manual smoke install on real cluster — owner clearance pending

```release-notes
Lands two new pattern detectors (#2 IB link flap, #10 CUDA OOM
fragmentation-vs-true discriminator), 8 pattern design specs for the
remaining v1.0 root-cause patterns, opt-in default-deny NetworkPolicy
+ Prometheus Operator ServiceMonitor on the Helm chart, the
EvictedPod → PodName/PodNamespace verdict-attribute deprecation
co-emit, per-detector allocs/event bench harness, SLO-rules label
fix, and the v1.0-rc1 knowledge-gap audit set (cut criteria, ops gaps,
governance gaps, test audit, simplification audit, threat model,
reference envs, adoption pipeline, standards roadmap).
```

---------

Signed-off-by: Tri Lam <tri@maydow.com>
Co-authored-by: Tri Lam <tri@maydow.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant